Chapter 8 Principles of Feature Engineering and Selection

8.3 Feature scaling via standard normalization

In this Section we discuss the benefits of normalizing input features - also called feature scaling - via the standard normalization scheme (i.e., subtracting the mean and dividing off the standard deviation of each feature) for linear two class classification in terms of how it helps drastically improve training speed when using first order optimization methiods like gradient descent. This closely mirrors the discussion of the same topic for linear regression in Section 8.3. This 'optimization trick' will also prove especially important when we deal with nonlinear supervised models in the future - like e.g., deep networks - where training via first order methods are especially popular.

In [ ]:
# This code cell will not be shown in the HTML version of this notebook
# import custom library
import sys
sys.path.append('../../')
from mlrefined_libraries import superlearn_library as superlearn
from mlrefined_libraries import math_optimization_library as optlib
datapath = '../../mlrefined_datasets/superlearn_datasets/'

# demos for this notebook
classif_plotter = superlearn.lin_classification_demos
optimizers = optlib.optimizers
classification_plotter = superlearn.classification_static_plotter.Visualizer();
feature_scaling_tools = superlearn.feature_scaling_tools
static_plotter = optlib.static_plotter.Visualizer()
cost_lib = superlearn.cost_functions
normalizers = superlearn.normalizers 

# import autograd functionality to bulid function's properly for optimizers
import autograd.numpy as np

# import timer
from datetime import datetime 

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

8.3.1 Feature scaling for single input datasets

Below we load in a simple two class dataset with $N=1$ input feature. A quick glance at the data and we can intuit that - if tuned properly - a two class linear classifier will perform very well on this dataset, as the data appears to be largely separable by a zero-dimensional line (i.e., a point).

In [6]:
# This code cell will not be shown in the HTML version of this notebook
# load in dataset
data = np.loadtxt(datapath + '2d_classification_data_v1.csv',delimiter = ',')

# get input/output pairs
x = data[:-1,:] + 4
y = data[-1:,:] 
y[0,-1] = -1

# reform data with adjustments
data = np.vstack((x,y))

# plot dataset
demo = classif_plotter.Visualizer(data)
demo.plot_data()

Since we only need properly tune two parameters in learning a linear classifier for this dataset, we can visualize any classification cost function over it. Here we will use the softmax cost. Below we load in this cost function from a backend file - containing precisely the softmax cost implemented in Section 9.1 - and then show its contour plot over the above dataset.

The contours of this function are extremely long and narrow - and so we can predict that gradient descent will struggle immensely (zig-zagging very slowly - as we first described in Section 6.4, and discussed again in Section 8.4) in determining the global minimum (located inside the smallest contour shown below).

In [8]:
# This code cell will not be shown in the HTML version of this notebook
# load in cost function
softmax = cost_lib.choose_cost(x,y,'softmax')
counting_cost = cost_lib.choose_cost(x,y,'counter')

# show the contours of an input function over a desired viewing range
static_plotter.two_input_contour_plot(softmax,[],xmin = -100,xmax = 100,ymin = -30,ymax = 30,num_contours = 5,show_original = False)

We can confirm this intuition regarding gradient descent by making a run - which we do below. Beginning at the point $\mathbf{w} = \begin{bmatrix} 20 \\ 30 \end{bmatrix}$ we visualize $100$ steps of gradient descent with a fixed steplength $\alpha = 1$. This was the largest steplength size of the form $10^{-\gamma}$ with $\gamma$ an integer that we found that did not cause gradient descent to diverge here.mm

In [10]:
# This code cell will not be shown in the HTML version of this notebook
# load in an optimizer
g = softmax; w = np.array([20.0,30.0])[:,np.newaxis]; max_its = 100; alpha_choice = 1;
weight_history_1,cost_history_1 = optimizers.gradient_descent(g,alpha_choice,max_its,w)
count_history_1 = [counting_cost(v) for v in weight_history_1]  # compute misclassification history

# show run on contour plot
static_plotter.two_input_contour_plot(softmax,weight_history_1,xmin = -100,xmax = 100,ymin = -30,ymax = 40,num_contours = 5,show_original = False)

Here as usual the steps are colored from green to red as gradient descent begins (green) to when it ends (red). From this perspective we can see that we still have a long way to travel to reach the minimum of the cost function, and that these steps will likely continue to zig-zag considerably.

Plotting the logistic sigmoid (in red) and associated counting cost function (in blue) given by the final set of weights learned in this run of gradient descent - those associated with the final red point plotted on the contour plot above - we can see that the fact that these weights lie so far from the true minimum of the cost function truly affect the classifier's quality immensely - we learn a poor classifier here.

In [11]:
# This code cell will not be shown in the HTML version of this notebook
# the original data and the best learned logistic sigmoid and counting cost
# line learned from our gradient descent run above
ind = np.argmin(count_history_1)
best_weights = weight_history_1[ind]
demo.plot_fit(best_weights)

In Section 8.3 we saw how normalizing input drastically improved the topology of linear regression cost functions, making it much easier for gradient descent (as well as other local optimization methods) to find global minima and hence learn a proper linear regressor. Here will see how input normalization or feature scaling using the standard normalization scheme first discussed in Section 8.4, as it provides precisely the same benefit in the context of two class linear classification.

Thus we will normalize our our input by subtracting off its mean and dividing by its standard deviation - precisely as we did in Section 8.4 with linear regression. This means we will replace each input $x_p$ point with its mean centered unit deviation analog as

\begin{equation} x_p \longleftarrow \frac{x_p - \mu}{\sigma} \end{equation}

where the sample mean of the inputs $\mu$ is defined as

\begin{equation} \mu = \frac{1}{P}\sum_{p=1}^{P}x_p \\ \end{equation}

and the sample standard deviation of the inputs $\sigma$ is defined as

\begin{array} \ \sigma = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_p - \mu \right)^2}. \end{array}

As we will see below for the particular dataset we are currently studying, this simple normalization 'trick' has a profound impact on the shape of our cost function. Also note: this normalization scheme is invertible, meaning that after performing it we can always return to our original data by simple re-multiplying a normalized input by the original standard deviation and adding the original mean.

We developed Python functionality to perform this normalization in Section 8.4, and load in this scheme from a backend file called normalizers. Below we load in this code and employ it.

Lets now examine the contours of the softmax cost function using the normalized input. As we can see below, the contours of this cost function are drastically improved.

In [13]:
# This code cell will not be shown in the HTML version of this notebook
# create normalizer
normalizer = normalizers.standard_normalizer(x)

# normalize input
x_normalized = normalizer(x)

# define new cost with normalized data
softmax_2 = cost_lib.choose_cost(x_normalized,y,'softmax')

# show the contours of an input function over a desired viewing range
static_plotter.two_input_contour_plot(softmax_2,[],xmin = -25,xmax = 23,ymin = -11,ymax = 38,num_contours = 5,show_original = False)

Below we show an animation where we form a sequence of softmax cost functions using a convex combination of the original and normalized data

\begin{equation} \left(1 - \lambda\right)x_p + \lambda \left( \frac{x_p - \mu}{\sigma} \right) \end{equation}

where $\lambda$ ranges from $0$ (i.e., we use the original input) to $\lambda = 1$ (where we use the normalized versions). Plotting the contour of each softmax cost for a $50$ evenly spaced values of $\lambda$ between $0$ and $1$ shows how the original softmax cost function is transformed by normalizing the input. You can use the slider below to transition between the contours of the original cost function (when the slider is all the way to the left) and cost function taking in normalized input (when the slider is all the way to the right).

In [10]:
# This code cell will not be shown in the HTML version of this notebook
# animation showing cost function transformation from standard to normalized input
scaling_tool = feature_scaling_tools.Visualizer(x,x_normalized,y,'softmax')
scaling_tool.animate_transition(num_frames=50,xmin = -25,xmax = 23,ymin = -11,ymax = 38,num_contours = 5)
Out[10]:



Now we make a run of gradient descent to find an approximate global minimum of this softmax cost. We will initialize at precisely the same point as was done previously, and use the largest steplength of the form $10^{-\gamma}$ possible, which in this instance is $\alpha = 10$. Whenever we normalize input like this we can always use a larger fixed steplength value. Here - however - we will only use $25$ steps (one-fourth as many as we ran above in minimizing the softmax with un-normalized data). Nonetheless even with so few steps - as can be seen below - we easily minimize the softmax cost and find an approximate global global minimum.

In [15]:
# This code cell will not be shown in the HTML version of this notebook
# load in an optimizer
g = softmax_2; w = np.array([20.0,30.0])[:,np.newaxis]; max_its = 25; alpha_choice = 10;
weight_history_2,cost_history_2 = optimizers.gradient_descent(g,alpha_choice,max_its,w)
count_history_2 = [counting_cost(v) for v in weight_history_2]  # compute misclassification history

# s of an input function over a desired viewing range
static_plotter.two_input_contour_plot(softmax_2,weight_history_2,xmin = -25,xmax = 23,ymin = -11,ymax = 38,num_contours = 5,show_original = False)

Using only a quarter of the number of descent steps - with precisely the same setup as previously - we get much closer to the cost function minimum!

Let us plot the logistic and counting cost fits associated with the final set of weights (the final red point above) on our original dataset below. Notice - just as with linear regression - that in order to make this plot we must treat each new input test point on the predictor precisely as we treated our original input: i.e., we must subtract off the same mean and divide off the same standard deviation. Thus with our fully tuned parameters $w_0^{\star}$ and $w_1^{\star}$ our linear predictor for any input point $x$ is, instead of $ w_0^{\star} + w_1^{\star}x^{\,}$, now

\begin{equation} \text{normalized_model}\left(x\right) = \text{tanh}\left(w_0^{\star} + w_1^{\star}\left(\frac{x - \mu}{\sigma}\right)\right). \end{equation}

Again - since we normalized the input data we trained on, we must normalize any new input point we shove through our trained logistic model.

The final logistic predictor and counting cost - plotted below in red and blue below respectively - is far superior to the one we found previously, where we took $4$ times as many gradient descent steps, prior to normalizing the input data.

In [16]:
# This code cell will not be shown in the HTML version of this notebook
# the original data and the best learned logistic sigmoid and counting cost
# line learned from our gradient descent run above
ind = np.argmin(count_history_2)
best_weights = weight_history_2[ind]
demo.plot_fit(best_weights,transformer = normalizer)

8.3.1 Feature scaling for multi-input datasets

As with linear regression, with two class classification when dealing with $N$ dimensional input datasets we can normalize each input feature precisely as we did above, and gain the same sort of benefit in terms of speeding up gradient descent. This means we mean normalize and divide off the standard deviation a long each input axis, replacing each coordinate of our input data as

\begin{equation} x_{p,n} \longleftarrow \frac{x_{p,n} - \mu_n}{\sigma_n} \end{equation}

where $x_{p,n}$ is the $n^{th}$ coordinate of the $p^{th}$ input point and $\mu_n$ and $\sigma_n$ are the mean and standard deviation of the $n^{th}$ dimension of the data, respectively, and are defined as

\begin{array} \ \mu_n = \frac{1}{P}\sum_{p=1}^{P}x_{p,n} \\ \sigma_n = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_{p,n} - \mu_n \right)^2}. \end{array}

Generally speaking the only time when this kind of standard normalization will not be helpful is when the input along a certain dimension of a dataset is constant and has zero standard deviation i.e., $\sigma_n = 0$. In scaling off this value we will then be dividing by $0$, which should always be avoided. When such an instance arises where some input dimension of a dataset is constant it can ether be thrown out - as is the case in many metadata datasets since this input feature does not help discriminate between classes - or the entire input requires pre-processing (as is the case with image-based data) which that appropriately adjusts the values along such an input dimension and thus we avoid the potential of dividing by zero in normalizing it.

Below we will compare a run of gradient descent on standard and normalized data using a real $N = 8$ input breast cancer dataset, a description of which you can find here). Below we load in the data, run gradient descent, and then normalize the input and do the same. Afterwards we compare the two runs.

Below we run gradient descent for $100$ iterations, using $\alpha = 10^{-1}$ the largest steplength parameter of the form $10^{-\gamma}$ that produced convergence.

In [19]:
# This code cell will not be shown in the HTML version of this notebook
# load in dataset
data = np.loadtxt(datapath + 'breast_cancer_data.csv',delimiter = ',')

# get input/output pairs
x = data[:-1,:]
y = data[-1:,:] 

# load in cost function
softmax = cost_lib.choose_cost(x,y,'softmax')
counting_cost = cost_lib.choose_cost(x,y,'counter')

# load in an optimizer
g = softmax; w = 0.1*np.random.randn(x.shape[0]+1,1); max_its = 100; alpha_choice = 10**(-1);
weight_history_1,cost_history_1 = optimizers.gradient_descent(g,alpha_choice,max_its,w)
count_history_1 = [counting_cost(v) for v in weight_history_1]

Now we normalize the input and create corresponding softmax and counting cost functions, and then perform the same run using the same initial point and number of descent steps.

In [21]:
# This code cell will not be shown in the HTML version of this notebook
# create normalizer
normalizer = normalizers.standard_normalizer(x)

# normalize input
x_normalized = normalizer(x)

# make new costs for normalized data
softmax_2 = cost_lib.choose_cost(x_normalized,y,'softmax')
counting_cost_2 = cost_lib.choose_cost(x_normalized,y,'counter')

# load in an optimizer
g = softmax_2; alpha_choice = 1;
weight_history_2,cost_history_2 = optimizers.gradient_descent(g,alpha_choice,max_its,w)
count_history_2 = [counting_cost_2(v) for v in weight_history_2]

Below we plot the cost and misclassification histories from our two runs of gradient descent. As can be seen, the run on the normalized input (in magenta) converges much faster.

In [22]:
# This code cell will not be shown in the HTML version of this notebook
# plot the cost function history for a given run
classification_plotter.plot_cost_histories([cost_history_1,cost_history_2],[count_history_1,count_history_2],start = 0,labels = ['standard','normalized'])

Note to evaluate any new test point $\mathbf{x}$ - using our fully tuned parameters $w_0^{\star},w_1^{\star},...,w_N^{\star}$,we have to treat it as we did our training data - by normalizing each input feature using the same statistics we computed on the training data above.

\begin{equation} \text{normalized_model}\left(\mathbf{x}\right) = \text{tanh}\left(w_0^{\star} + w_1^{\star}\left(\frac{x_1 - \mu_1}{\sigma_1}\right) + w_2^{\star}\left(\frac{x_2 - \mu_2}{\sigma_2}\right) + \cdots + w_N^{\star}\left(\frac{x_N - \mu_N}{\sigma_N}\right)\right). \end{equation}
In [ ]:
 
In [ ]:
 

10.4 Feature scaling via standard normalization

We have now seen multiple times how feature scaling via the standard normalization scheme - that is subtracting the mean and dividing off the standard deviation of each input feature - substantially improves our ability to properly tune linear regression and two-class classification cost functions when employing gradient descent and coordinate descent methods (see Sections 8.4 and 9.4). Unsurprisingly standard normalization provides the same benefits in the case of multiclass classification - whether we employ the One-versus-all framework or the multiclass perceptron / softmax cost - which we explore here.

10.3.1 One-versus-All classification

As we saw in Section 9.1, One-versus-All (OvA) multiclass classification applied to a dataset with $C$ classes consists of applying $C$ two-class classifiers to the dataset - each assigned to learn a linear separator between a single class and all the other data - and combining the results. Because we have already seen the benefit of using standard normalization in the context of two-class classification (in Section 9.4), clearly employing the same normalization scheme in the context of OvA will provide the same benefits. Here we illustrate this fact using a simple single-input multiclass dataset, loaded in and shown below.

In [2]:
# This code cell will not be shown in the HTML version of this notebook
# load in a toy dataset to illustrate the benefits of standard normalization 
data = np.loadtxt(datapath + 'singleinput_multiclass_dataset.csv',delimiter = ',')

# get input/output pairs
x = data[:-1,:]
y = data[-1:,:] 

# multiclass single-input plotter
demo = classif_plotter.Visualizer(data)
demo.plot_data()

Below we train an OvA multiclass classifier to represent / separate the $C = 4$ classes of this dataset. For each of the $C$ two-class subproblems we use the softmax cost function, $\alpha = 10^{-1}$ (the largest fixed steplength of the form $10^{-\gamma}$ with $\gamma$ an integer we could find that produced consistent convergence for this example), and $100$ iterations. Each subproblem is initialized at the point $\mathbf{w} = \begin{bmatrix} 1 \\ 4 \end{bmatrix}$.

Below we plot each gradient descent run on its corresponding cost function contour, labeling the axes of each contour plot with its respective weights. Here we can see how the contours of every subplot are highly elliptical, making it extremely difficult for gradient descent to make progress towards a good solution.

In [5]:
# This code cell will not be shown in the HTML version of this notebook
# load in OvA framework
ova = superlearn.one_versus_all

# run OvA on dataset
w = np.array([[1],[4]]); alpha_choice = 10**(-1); max_its = 100;
weight_history_1, count_history_1 = ova.train(x,y,max_its = max_its,alpha_choice = alpha_choice,w = w)

# show run of each sub-problem on respective cost function contour
ova_static_plotter.two_input_contour_plot(weight_history_1,x,y,xmin = -10,xmax = 10,ymin = -7,ymax = 7,num_contours = 5)

Plotting the resulting fit to the data - which we do below - we can see just how poorly we have been able to learn a reasonable representation of this dataset due to the highly skewed contours of its cost functions. Here notice that the fit shown is produced by evaluating the fusion rule described in the previous sections of this Chapter over a fine range of inputs across the input space of the dataset.

In [6]:
# This code cell will not be shown in the HTML version of this notebook
# plot the best learned representation from our run above
ind = np.argmin(count_history_1)
best_weights = weight_history_1[ind]
demo.plot_fit(best_weights)

Lets now compare this to a run of the same general format, but having employed standard normalization first. That is - since this is a single input dataset - we replace each input by subtracting off the mean of the entire set of inputs and dividing off its standard deviation as

\begin{equation} x_p \longleftarrow \frac{x_p - \mu}{\sigma} \end{equation}

where the sample mean of the inputs $\mu$ is defined as

\begin{equation} \mu = \frac{1}{P}\sum_{p=1}^{P}x_p \\ \end{equation}

and the sample standard deviation of the inputs $\sigma$ is defined as

\begin{array} \ \sigma = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_p - \mu \right)^2}. \end{array}

We produced nice Python functionality for performing this standard normalization in Section 8.4, and we load this in from a backend file called normalizers and apply it to our current dataset. Afterwards we re-run OvA training with for just $10$ steps of gradient descent initialized at the same initial point as above, and use a fixed steplength of $\alpha = 10$ in each instance (again, the largest steplength value we could use of this form that caused convergence). Notice - as is always the case after normalizing the input of a dataset - that here we can use far fewer steps and a much larger steplength value.

As we can see analyzing the contour plots below, unsurprisingly performing standard normalization greatly improves the contours of each sub-problem cost function making them much easier to minimize.

In [10]:
# This code cell will not be shown in the HTML version of this notebook
# create normalizer
normalizer = normalizers.standard_normalizer(x)

# normalize input
x_normalized = normalizer(x)

# run one-versus-all 
weight_history_2, count_history_2 = ova.train(x_normalized,y,max_its = 10,alpha_choice = 10,w = w)

# show run on contour plot
ova_static_plotter.two_input_contour_plot(weight_history_2,x_normalized,y,xmin = -14,xmax = 14,ymin = -7,ymax = 7,num_contours = 5)

Examining the corresponding fit - plotted below - we can gain significantly better performance. Note - as is always the case - in order to evaluate new test points (e.g., to produce the plotted fit shown below) we must normalize them precisely as we did the training data.

In [11]:
# This code cell will not be shown in the HTML version of this notebook
# plot the best learned representation from our run above
ind = np.argmin(count_history_2)
best_weights = weight_history_2[ind]
demo.plot_fit(best_weights,transformer = normalizer)

10.3.2 Simultaneous multiclass frameworks

Standard normalization also substantially improves training for simultaneous multiclass frameworks - the multiclass perceptron and softmax functions. As a simple example illustrating this fact, we first examine how well multiclass perceptron fits to the same toy dataset used in the previous Subsection - before and after normalizing its input.

In [12]:
# This code cell will not be shown in the HTML version of this notebook
# load in a toy dataset to illustrate the benefits of standard normalization 
data = np.loadtxt(datapath + 'singleinput_multiclass_dataset.csv',delimiter = ',')

# get input/output pairs
x = data[:-1,:]
y = data[-1:,:] 

# multiclass single-input plotter
demo = classif_plotter.Visualizer(data)
demo.plot_data()

Now we run $100$ steps of gradient descent with a fixed steplength of $10^{-2}$, the largest steplength of the form $10^{-\gamma}$ with $\gamma$ an integer we found to produce convergence. This will not result in significant learning, since we saw in Example 2 of the previous Section that with this dataset we required close to $10,000$ iterations to produce adequate learning.

Plotting the learned fusion rule model employing the best set of weights from this run we can indeed see that our model performs quite poorly.

In [17]:
# This code cell will not be shown in the HTML version of this notebook
# load in cost function
perceptron_1 = cost_lib.choose_cost(x,y,'multiclass_perceptron')
counting_cost_1 = cost_lib.choose_cost(x,y,'multiclass_counter')

# load in an optimizer
g = perceptron_1; w = np.array([[1.0,1.0,1.0,1.0],[4.0,4.0,4.0,4.0]]); max_its = 100; alpha_choice = 10**(-4);
weight_history_1,cost_history_1 = optimizers.gradient_descent(g,alpha_choice,max_its,w)
count_history_1 = [counting_cost_1(v) for v in weight_history_1]  # compute misclassification history

# the original data and the best learned logistic sigmoid and counting cost
# line learned from our gradient descent run above
ind = np.argmin(count_history_1)
best_weights = weight_history_1[ind]
demo.plot_fit(best_weights)

Now we normalize the input using standard normalization - as described in Sections 8.4, 9.4, and the subsection above.

With the input normalized we now make a run of gradient descent - much shorter and with a much larger steplength than used previously (as is typically possible when normalizing the input to a cost function). This results in considerably better performance after just a few steps, as we plot the result of the best found weights below as well.

In [20]:
# This code cell will not be shown in the HTML version of this notebook
# create normalizer
normalizer = normalizers.standard_normalizer(x)

# normalize input
x_normalized = normalizer(x)

# make new costs for normalized data
perceptron_2 = cost_lib.choose_cost(x_normalized,y,'multiclass_perceptron')
counting_cost_2 = cost_lib.choose_cost(x_normalized,y,'multiclass_counter')

# load in an optimizer
g = perceptron_2; max_its = 30; alpha_choice = 1;
weight_history_2,cost_history_2 = optimizers.gradient_descent(g,alpha_choice,max_its,w)
count_history_2 = [counting_cost_2(v) for v in weight_history_2]  # compute misclassification history

# the original data and the best learned logistic sigmoid and counting cost
# line learned from our gradient descent run above
ind = np.argmin(count_history_2)
best_weights = weight_history_2[ind]
demo.plot_fit(best_weights,transformer = normalizer)

If we compare the cost and misclassification count per iteration of both gradient descent runs above we can see the startling difference between gradient the two gradient descent runs. In particular, in the time it takes for the run on standard normalized input to converge to a perfect solution the unnormalized run fails to make any real progress at all.

In [21]:
# This code cell will not be shown in the HTML version of this notebook
# plot the cost function history for a given run
classification_plotter.plot_cost_histories([cost_history_1,cost_history_2],[count_history_1,count_history_2],start = 0,labels = ['unnormalized','normalized'])

© This material is not to be distributed, copied, or reused without written permission from the authors.